03-shortest-curves-float32.ipynb, we have seen that using float32 instead of float64 can break the convergence of the ODE solver.05-ode-stability.ipynb, we have seen that omitting the computationally expensive fun_jac parameter is likely to break the convergence of the ODE solver.07-moons-rbf.ipynb, we have seen that solving the geodesics ODE is unstable on the moons dataset. The solver does not converge to a solution. Instead, it often diverges without finding even an approximate solution. It is worth exploring alternative methods to find a geodesic in the latent space w.r.t. the Riemannian metric. Specifically, we will implement Algorithm 1 of Shao (2017), which uses gradient descent to find a piecewise linear curve at a local minimum of the curve length / energy functional.
%load_ext autoreload
%autoreload 2
import numpy as np
import tensorflow as tf
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# Set up plotly
init_notebook_mode(connected=True)
layout = go.Layout(
width=700,
height=500,
margin=go.Margin(l=40, r=40, b=20, t=20),
showlegend=False
)
config={'showLink': False}
# Make results completely deterministic
seed = 0
np.random.seed(seed)
tf.set_random_seed(seed)
We use the trained generator model from 04-shortest-curves-vae.ipynb. Note that this model only predicts the mean, while the variance was fixed to 0.1.
cone_generator = tf.keras.models.load_model('models/cone-generator.h5')
session = tf.keras.backend.get_session()
from src.plot import plot_3d_surface
points_per_axis = 100
latent_x = np.linspace(-2, 2, points_per_axis)
latent_y = np.linspace(-2, 2, points_per_axis)
surface = plot_3d_surface(latent_x, latent_y, cone_generator, layout)
in the latent space together with the start and end point of our geodesic, i.e. the task.
from src.plot import plot_magnification_factor
# Values from 04-shortest-curves-vae.ipynb
z_start = np.array([-1.7773801, -0.99523455])
z_end = np.array([1.872246, 0.8469684])
task_plot = go.Scatter(
x = np.array([z_start[0], z_end[0]]),
y = np.array([z_start[1], z_end[1]]),
mode = 'markers',
marker = {'color': 'red'}
)
heatmap = plot_magnification_factor(session, latent_x, latent_y,
cone_generator,
additional_data=[task_plot],
layout=layout,
log_scale=False)
Before implementing anything, we need a performance measure for a given curve. We are interested in its length and want curves with constant velocity in the VAE's input space. Therefore, we also measure the proportion between the curve's longest and its shortest segment. This corresponds to the proportion of the highest and the lowest velocity, since each segment corresponds to an equal interval in $t$ ($t$ parametrizes the curve by going from $0$ to $1$).
def evaluate(generator, latent_curve):
input_curve = generator.predict(latent_curve)
lengths = []
for start, end in zip(input_curve[:-1], input_curve[1:]):
length = np.linalg.norm(end - start)
lengths.append(length)
length = np.sum(lengths)
velocity_ratio = np.max(lengths) / np.min(lengths)
print('Length: %f, Max velocity ratio: %f' % (length, velocity_ratio))
We can use equation 5 from Shao (2017) to compute the gradient at each step. For this, we need to evaluate the Jacobian and the generator output itself at each point of the piecewise linear curve.
from src.util import get_jacobian_fast as get_jacobian
def shao(session, z_start, z_end, generator, num_nodes, save_every=10):
# Initialize the curve
nodes = np.linspace(0, 1, num_nodes)
curve = z_start + np.outer(nodes, z_end - z_start)
dt = 1 / (num_nodes-1)
# Get a graph op for the jacobian
generator_input = tf.expand_dims(generator.input[0], 0)
generator_output = generator(generator_input)[0]
jacobian = get_jacobian(generator_output, generator_input)
jacobian = tf.squeeze(jacobian)
learning_rate = 0.01
epsilon = 0.001 * (num_nodes-2)
# Save the curve regularly
iterations = [np.copy(curve)]
step = 0
gradient_sum = 1
while gradient_sum > epsilon:
gradient_sum = 0
for i in range(1, num_nodes-1):
# Compute the update for point i
jacobian_z = session.run(jacobian, feed_dict={
generator.input: np.array([curve[i]])
})
g = generator.predict(curve[[i-1, i, i+1]])
# This is equation 5in Shao (2017)
gradient = -1 / dt * jacobian_z.T.dot(g[2] - 2 * g[1] + g[0])
curve[i] -= learning_rate * gradient
gradient_sum += np.sum(np.square(gradient))
step += 1
if step % save_every == 0:
iterations.append(np.copy(curve))
print('Finished after %d steps' % step)
return curve, iterations
and visualize the algorithm's iterations.
%%time
save_every = 20
curve, iterations = shao(session, z_start, z_end, cone_generator,
num_nodes=50, save_every=save_every)
evaluate(cone_generator, curve)
print('-' * 20)
from src.plot import plot_latent_curve_iterations
plot_latent_curve_iterations(iterations, [heatmap, task_plot], layout,
step_size=save_every)
For our purposes, we cannot use the algorithm above. For that, we would need to evaluate the generator's output at specific points. However, the generator is stochastic, which is why Arvanitidis (2017) define the Riemannian metric differently from Shao (2017). So, we can only use the metric tensor.
While Shao et al. compute the energy of each curve segment exactly via
$\dfrac{1}{\delta t} \Vert g(z_{i+1}) - g(z_{i}) \Vert^2$,
we need to approximate this by computing
$\dfrac{1}{\delta t} \dot{\gamma}(t)^T G_{\gamma(t)} \dot{\gamma}(t)$,
similar to the intergral the energy functional of equation 2 in Shao (2017). This only depends on the velocity $\dot{\gamma}(t)$ and the metric $G_{\gamma(t)}$, which we both can compute.
We can compute both the energy and the length of a curve given the mean generator and the std generator in TensorFlow:
from src.util import get_metric_batch_op
def get_energy_op(curve, mean_generator, std_generator=None):
dt = 1 / tf.cast(tf.shape(curve)[0] - 1, curve.dtype)
# Get the starts and ends of the curve's segments
starts = curve[:-1]
ends = curve[1:]
velocities = (ends - starts) / dt
# Evaluating the metric in the middle of each segment gives the best
# estimates
middles = starts + 0.5 * (ends - starts)
metrics = get_metric_batch_op(middles, mean_generator, std_generator)
# energy = dt * 0.5 * velocity dot metric dot velocity
energies = tf.matmul(tf.expand_dims(velocities, 1), metrics)
energies = tf.matmul(energies, tf.expand_dims(velocities, 2))
energies = 0.5 * dt * energies
energy = tf.reduce_sum(energies)
return energy
def get_length_op(curve, mean_generator, std_generator=None):
dt = 1 / tf.cast(tf.shape(curve)[0] - 1, curve.dtype)
# Get the starts and ends of the curve's segments
starts = curve[:-1]
ends = curve[1:]
velocities = (ends - starts) / dt
# Evaluating the metric in the middle of each segment gives the best
# estimates
middles = starts + 0.5 * (ends - starts)
metrics = get_metric_batch_op(middles, mean_generator, std_generator)
# length = dt * sqrt(velocity dot metric dot velocity)
lengths = tf.matmul(tf.expand_dims(velocities, 1), metrics)
lengths = tf.matmul(lengths, tf.expand_dims(velocities, 2))
lengths = dt * tf.sqrt(lengths)
length = tf.reduce_sum(lengths)
# Also return the maximum ratio between lengths of segments
return length, tf.reduce_max(lengths) / tf.reduce_min(lengths)
Now that we have ops for computing the length and energy of a curve, we can create the curve as a variable in TensorFlow and minimize the energy.
def find_geodesic_discrete(session, z_start, z_end, mean_generator,
std_generator=None, num_nodes=20, save_every=10,
learning_rate=0.001, max_steps=200, log_every=20,
reg=0.0):
dtype = mean_generator.input.dtype
latent_dim = mean_generator.input.shape.as_list()[1]
# Initialize the curve
nodes = np.linspace(0, 1, num_nodes)
curve_init = z_start + np.outer(nodes, z_end - z_start)
# Create the curve variables
curve_start = tf.constant(z_start, dtype=dtype, shape=[1, latent_dim])
curve_end = tf.constant(z_end, dtype=dtype, shape=[1, latent_dim])
curve_middle = tf.Variable(curve_init[1:-1], dtype=dtype)
curve = tf.concat([curve_start, curve_middle, curve_end], axis=0)
# Log the curve length regularly
length, vel_ratio = get_length_op(curve, mean_generator, std_generator)
# Optimize the energy
energy = get_energy_op(curve, mean_generator, std_generator)
loss = energy
if reg > 0:
loss += reg * vel_ratio
optimizer = tf.train.AdamOptimizer(learning_rate)
optimize = optimizer.minimize(loss, var_list=[curve_middle])
# Initialize the graph variables
session.run(curve_middle.initializer)
session.run([v.initializer for v in optimizer.variables()])
# Log the stats for the initial euclidean interpolation
iterations = [session.run(curve)]
current_length, current_energy, current_ratio = session.run(
[length, energy, vel_ratio])
print('Step 0, Length %f, Energy %f, Max velocity ratio %f' %
(current_length, current_energy, current_ratio))
# Optimize the curve energy
for step in range(1, max_steps + 1):
session.run(optimize)
if step % save_every == 0:
iterations.append(session.run(curve))
if log_every > 0 and step % log_every == 0:
current_length, current_energy, current_ratio = session.run(
[length, energy, vel_ratio])
print('Step %d, Length %f, Energy %f, Max velocity ratio %f' %
(step, current_length, current_energy, current_ratio))
return session.run(curve), iterations
The algorithm in Shao updates each point of the curve seperately while keeping the other points fixed. This would cause a large graph in TensorFlow, since we would have to create a separate tf.train.AdamOptimizer for each curve point.
We can see that updating all curve points in the same optimizer step works as well and takes less time to compute than the original algorithm above (600 vs. 18 complete curve updates per second). Therefore, we can run it for a higher number of steps yielding a sligthly shorter curve with steady velocity (1.1 vs 3.8 velocity ratio).
We can see that here minimizing the energy of the curve leads to a steady velocity, as stated in Shao (2017).
%%time
curve, iterations = find_geodesic_discrete(session, z_start, z_end,
cone_generator, num_nodes=100,
learning_rate=0.01,
max_steps=2000,
save_every=100,
log_every=200)
evaluate(cone_generator, curve)
print('-' * 20)
plot_latent_curve_iterations(iterations, [heatmap, task_plot], layout,
step_size=100)
After getting the discrete algorithm to work for the stochastic Riemannian metric, we can also try it out on the moons dataset, where the ODE solver failed to converge (see 07-moons-rbf.ipynb).
First, we load the generator from the VAE trained on the moons dataset. It also estimates the variance of the reconstructions using the RBF layer described in Arvanitidis (2017).
from tensorflow.python.keras import Model
from tensorflow.python.keras.layers import Lambda
from src.util import wrap_model_in_float64
from src.rbf import RBFLayer
# Load the model
with tf.keras.utils.CustomObjectScope({'RBFLayer': RBFLayer}):
moons_generator = tf.keras.models.load_model('models/moons-generator.h5')
# Get the mean and std predictors
_, mean, var = moons_generator.output
sqrt_layer = Lambda(tf.sqrt)
dec_mean = Model(moons_generator.input, mean)
dec_std = Model(moons_generator.input, sqrt_layer(var))
# Wrap the models in float64, since the std jacobian can become very large
# with rbf kernels
dec_mean = wrap_model_in_float64(dec_mean)
dec_std = wrap_model_in_float64(dec_std)
together with the two points between which we want to find the geodesic.
z_start = np.array([0.13, 1.28])
z_end = np.array([0.63, -0.42])
task_plot = go.Scatter(
x = np.array([z_start[0], z_end[0]]),
y = np.array([z_start[1], z_end[1]]),
mode = 'markers',
marker = {'color': 'red'}
)
heatmap_z1 = np.linspace(-4, 4, 200)
heatmap_z2 = np.linspace(-4, 4, 200)
heatmap = plot_magnification_factor(session,
heatmap_z1,
heatmap_z2,
dec_mean,
dec_std,
log_scale=True,
scale='hotcold',
additional_data=[task_plot],
layout=layout)
%%time
from src.discrete import find_geodesic_discrete
curve, iterations = find_geodesic_discrete(session, z_start, z_end,
dec_mean, std_generator=dec_std,
num_nodes=100,
learning_rate=0.001,
max_steps=5000,
save_every=500,
log_every=500)
print('-' * 20)
plot_latent_curve_iterations(iterations, [heatmap, task_plot], layout,
step_size=500)
This is clearly not the geodesic and minimizing the energy does not lead to a steady velocity this time. We can see that the problem here is the initialization of the curve. Since the discrete geodesic algorithm does gradient descent w.r.t. the curve energy, it is hard for it to find the true geodesic. For this, it would have to go over the hill of high variance to reach the valley on the right.
It is worth noting that this is a synthetic dataset and it is not clear, whether we will experience such problems on MNIST as well, for example. Nevertheless, we can tackle the problem here by using different curve intializations, for example using Bezier curves:
import bezier
bezier_coefficients = [-1, 0, 1]
plot_data = []
for coeff in bezier_coefficients:
direction = (z_end - z_start)
# Construct the third point of the Bezier polygon by going orthogonally
# from the middle point
middle = z_start + 0.5 * direction
orthogonal = np.array([- direction[1], direction[0]])
# Construct the Bezier curve
nodes = np.array([
z_start,
middle + coeff * orthogonal,
z_end
]).T
curve = bezier.Curve(nodes, degree=2)
eval_curve = curve.evaluate_multi(np.linspace(0, 1, 50)).T
# Plot the Bezier curve
curve_plot = go.Scatter(
x=eval_curve[:, 0],
y=eval_curve[:, 1],
mode='lines+markers',
line={'width': 1, 'color': '#00DD00'},
marker={'size': 5, 'color': '#00DD00'}
)
plot_data.append(curve_plot)
data = [heatmap, task_plot] + plot_data
iplot(go.Figure(data=data, layout=layout), config=config)
The rightmost Bezier curve looks like a promising init. It was generated with coeff=1 in the bezier_coefficients above. Let's try it out:
%%time
from src.discrete import find_geodesic_discrete_bezier
curve, iterations = find_geodesic_discrete_bezier(session, z_start, z_end,
dec_mean,
std_generator=dec_std,
num_nodes=100,
learning_rate=0.001,
max_steps=5000,
save_every=500,
log_every=500,
bezier_coeff=1.0)
print('-' * 20)
plot_latent_curve_iterations(iterations, [heatmap, task_plot], layout,
step_size=500)
Now, the algorithm converges to a small energy and small length while constantly increasing the steadiness of the velocity. The solution might not look like a geodesic, especially compared to the initial Bezier curve. However, note that when comparing the length, energy and maximum segment velocity ratio between the initial and the final curve, we can see that the solution clearly is closer to being a geodesic:
Initial Bezier curve:
Final curve: